PyData Amsterdam 2017
Ivo Everts, Data Scientist at GoDataDriven, BSc+MSc+PhD in AI (biased to Computer Vision).
Marcel Raas: Data Scientist at GoDataDriven, BSc+MSc+PhD in physics.
Gabriele Modena: former collegue deserving kudos here not only because he is in the conference committee.
Copyright © 2017 GoDataDriven. All rights reserved. Unless otherwise indicated, all materials on these pages are copyrighted by GoDataDriven. All rights reserved. No part of these pages, either text or image may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system or retransmission, in any form or by any means, electronic, mechanical or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.
(but please use the code if you can;)
import util, os, multiprocessing as mp
util.small_canvas_size(); util.figsize();
# viz
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import gridspec, mlab
%pylab inline
splot = subplot
# num
import pandas as pd
from scipy import signal, stats, ndimage
from sklearn import *
Populating the interactive namespace from numpy and matplotlib
id age gender city income
01 35 m ams NULL
02 20 m rot 40000
03 20 f rot 35000
04 40 f ams 50000
.. ... ...... .... ......

A signal can broadly be viewed as a function
$$f(\mathbf{t}) \rightarrow \mathbf{v}$$$$\mathbf{t} \in \mathcal{R}^N, \mathbf{v} \in \mathcal{R}^M.$$
Thus, mathematical operators can be applied to signals - this is already quite different from traditional data!
Usually, t denotes a step in time (i.e. N=1) and v is some scalar measurement at the moment of t (i.e. M=1). These are time series such as simple sensor measurements (temperature, humidity, seismic waves, people counts), financial transactions, etc. For e.g. images, t denotes image coordinates (row, column) and v is the (R,G,B) pixel value.
# some quasi-periodic function definition
def fn(t):
return sin(t) + cos(t*2) + 2*random.rand(len(t))
plot_f();
In order to represent this data for usage by a machine learning algorithm, we need to perform some sort of preprocessing and feature extraction.
# example
plot_inout_example();
There are challenges due to noise and signal length; interesting features lie in the periodicity of the signal; this contrasts with simple global statistics such as max, mean.
The preprocessing of and feature extraction from signals typically involves an operation called convolution, commonly denoted as $$(f * g)(t)$$ where f is our original function, and g is some other function.
Our data consists of a finite set of measurements at discrete steps, i.e. samples. This allows for a user-friendly mathematical notation of the convolution: $$(f*g)[\tau]=\sum_{t=-T}^T f[\tau-t]g[t].$$

Probably the most common operation in signal processing is smoothing - which is achieved by convolving the signal f with a (Gaussian) smoothing filter g. This can be used for noise reduction in e.g. audio signals, and blurring of images.
plot_gaussian_convolution_example()
# Basically any numericish package/module/language
# has functionality for performing convolution
# and associated kernel creation.
# It is an elementary operation for vector/matrix manupilation
# and thus relevant in particular to DSP.
plot_gaussians()
The Gaussian kernel is parameterized by its standard deviation, so we can control that.
# go through the code
plot_multiscale_convolution_example()
As an important side note of statistical nature: given the observed density below, how would convolution come in handy?
plot_gaussian_kde_example()
# example of smoothing a 2D signal - i.e. image blurring!
image = random.rand(100, 100)
for sigma, i in zip(sigmas, range(len(sigmas))):
splot(1, len(sigmas), i+1);
util.imshow(ndimage.filters.gaussian_filter(image, sigma));
util.xlabel('sigma={:.2f}'.format(sigma))
# couple of options for showing a 2D gaussian filter
splot(131);
image = zeros((11, 11));
image[5, 5] = 1;
util.imshow(util.imscale(
ndimage.filters.gaussian_filter(image, 3)
))
splot(132);
x, y = meshgrid(range(101), range(101))
z = mlab.bivariate_normal(x, y, mux=50, muy=50,
sigmax=10, sigmay=10)
util.imshow(util.imscale(z))
splot(133, projection='3d').plot_surface(x, y, z);
So, I guess you are starting to get those convolutional net imagery...!?

Besides smoothing, one of the most often used applications of convolution on images results in edge detection, which lies at the heart of many visual recognition systems.
city = util.imread('city.jpg');
forest = util.imread('forest.jpg');
splot(121); util.imshow(util.detect_edges(city));
splot(122); util.imshow(util.detect_edges(forest));
Let's start with less complex images
# not all digits are available in our mini sample
digits = [0,1,3,4,5,6,8,9]
for i in range(len(digits)):
splot(2, 4, i+1);
util.imshow(get_digit(digits[i]));
# now this should be fully understood (and appreciated;)
hybrid_image_plot(get_digit(5))
The image is just a 2D function $I = f(row, column), I \in [0, 255]$*, so we can apply math to it.
* a pixel is usually stored with 1 byte per channel, and converted to a float in [0,1] for ease of computation.
Now we try to understand how edge detection works
digit = get_digit(8)
show_image_derivatives(digit)
These images are created by convolving with a derivative filter. The image gradient is defined as the vector $[\delta_x, \delta_y]$. It's magnitude $\sqrt{{\delta_x}^2+{\delta_y}^2}$ corresponds to the amount of contrast; it's orientation $\arctan(\delta_y / \delta_x)$ corresponds to the edge direction.
A derivative filter: Sobel
smoothing = array([+1, +2, +1]).reshape(3,1)
derivative = array([+1, 0, -1]).reshape(1,3)
sobel_derivative = dot(smoothing, derivative)
splot(131);
util.imshow(util.imscale(sobel_derivative));
util.title('Derivative filter')
splot(132);
im = array([[1, 0.5, 0],[1, 0.5, 0],[1, 0.5, 0]])
util.imshow(im); util.title('Image')
splot(133);
util.imshow(util.imscale(
ndimage.convolve(im, sobel_derivative
)));
util.title('Image derivative')
Now for dy as well
show_image_derivatives(im)
Convolving a signal with a Gaussian filter smooths the signal. Convolving a signal with a 'difference' filter like Sobel sort of differentiates the signal. Convolving a signal with a Gaussian derivative filter...
show_gaussian_orders()
Now we can extract multiscale differential features from images
show_gaussian_scale_orders()
We can do the same with 1D signals
# finally, some custom code for a gaussian function ;)
def g(s=1, o=0, n=9):
x = arange(n)-int(n/2)
e = exp( - pow(x, 2) / (2*pow(s,2)) )
p = sqrt(2 * pi)
def g_0():
return e / (s * p)
def g_1():
return -(e*x) / ( pow(s, 3) * p )
return {0: g_0, 1: g_1}[o]()
plot(g(s=5,o=0,n=31)); plot(g(s=5,o=1,n=31));
util.legend(['Gaussian', 'Gaussian derivative'])
Sobel vs Gauss
v = rand(10) * 10
# Raw Sobel filtering
plot(ndimage.filters.sobel(v))
# Gaussian derivative
plot(convolve(g(s=1,o=1,n=3), v, 'same'))
plot(v)
util.legend(['Sobel derivative',
'Gaussian derivative',
'Raw signal'])
xticks(range(len(v))); # so that we can see clearly...
util.figsize()
Let's have a look at the following stock market graph:

The graph looks quite random, but there seems to be a repeating pattern. How could we determine the periodicity?
We could smooth the signal (Gaussian smoothing with parameter $\sigma$) to get rid of most of the noise, followed by making a distribution of the distances of the $k$ nearest neighbours. The period will then hopefully be visible as a peak in this distribution.
The Fourier transformation will tell you how to break up a signal into signals that are periodic. You will get all frequencies/periods in your signal and it will tell you how much it contributes to the total as well!
Note: Let's not try to fully understand the math, but rather focus on the concepts and Python implementations.
Clearly it has an application in sound data, where the periodicity corresponds to the pitch of a sound, e.g. notes in music and intonation in speech.

The signal can be extended to 2D data. So also image data can be decomposed into frequencies.
This is for instance used in image compression, where the data for some frequencies that are not really important for the picture's perception is thrown away.
im = util.imread('images/jjfourier.jpg', edit_dir=True)
trans = fft.rfft2(im)
window_size = 10
trans[:, window_size:trans.shape[1]-window_size] = 0
util.image_subplots([im, fft.irfft2(-trans)])
$Sinusoidal Wave = Amplitude * \sin(Frequency * Time + Phase)$
$f(t) = A \cdot \sin(2\pi{\cdot}f{\cdot}t + \phi)$
# What happens when you add 2 harmonics
HTML('<iframe src=https://betterexplained.com/examples/fourier/?cycles=0,1,1 width=600 height=280></iframe>')
Any function can be written as a sum of harmonics
plot(X, y, color="blue", label="y")
plot(X, f, color="red", label="f(t)")
legend(fontsize=15, loc='upper left');
A.k.a. the Fourier series $$f(t) = a_0 + \sum_{n=1}^\infty [a_n \cos(\omega_n t) + b_n \sin(\omega_n t)]$$
where:
And mr. Euler came up with this: $e^{i \varphi} = \cos \varphi + i \sin \varphi$ so we see the relationship between complex exponentials and harmonic functions.
The discrete Fourier transform of a real sequence of numbers will be a sequence of complex numbers of the same length
frequency = 5.0
phs = np.linspace(0, frequency * 2 * pi, 128, endpoint=False)
X = cos(phs)
fft_bins = rfftfreq(len(X), d=1/float(len(X)))
F = real(fft.rfft(X))
fft_plot((phs, X), (fft_bins, F))
noisy_X = add_gaussian_noise(X)
noisy_F = real(fft.rfft(noisy_X))
fft_plot((phs, noisy_X), (fft_bins, noisy_F))
Recall: a signal can smoothed by convolving with a Gaussian filter.
def f(t):
return sin(t) + cos(t*2) + 2*random.rand(len(t))
g = gaussian(3, 1)
t = arange(255)
v = f(t)
plot_convolution_result(g, t, v)
We can determine the frequencies in the signal by using the Fourier transform.
freq = rfftfreq(len(t), d=1/float(len(t)))
power_spectrum = absolute(fft.rfft(v))**2
idx = argsort(freq)
plot(freq[idx], power_spectrum[idx]);
title('Power spectrum of rfft(v)')
xlabel('freq'); ylabel('power');
Now let's look at the power spectrum of the convolved signal.
plot(freq[idx], power_spectrum_convolved[idx]);
title('Power spectrum of the convolution v * (g/sum(g))')
xlabel('freq'); ylabel('power');
As expected, the high frequencies are attenuated after smoothing the signal.
Now let's look at the power spectrum of the Gaussian filter
plot(freq[idx], power_spectrum_window[idx]);
util.title('Power spectrum of the Gaussian filter (zero - padded)')
util.xlabel('freq'); util.ylabel('power');
And finally: the product of the spectra of the filter and the original signal.
prod = power_spectrum_window * power_spectrum
plot(freq[idx], prod[idx]);
util.title('Product of the window and signal power spectra ')
util.xlabel('freq'); util.ylabel('power');
What the F!? This looks exactly the same as the spectrum of the filtered signal. Could this be a theorem?
The Convolution Theorem says that given two functions $f$ and $g$:
$$ \mathrm{FT}(f * g) = \mathrm{FT}(f) \cdot \mathrm{FT}(g). $$Meaning that convolution in the time domain is equivalent to multiplication in the frequency domain.
The FFT is applied on speech data in the last section of this tutorial.
Why are we not always using the Fourier Transform for representing signals? Come on, think about it: with the FT we can represent any signal precisely!
This is a classic: as opposed to precisely modelling the data, we nowadays collect examples and extract features for representing that data in a statistical formulation of the problem. A.k.a. machine learning.
Let's think about factors in the process of signal measurement against which the representation should be invariant.

In feature extraction for machine learning, we aim to find the optimal trade-off between discriminative power and invariance.
what is the most discriminative representation of a signal?
what is the most invariant representation of a signal?
what is other common terminology to denote this phenomenon?

We can represent a signal by it's impuls response to a set of filters, commonly refered to as a filter bank. The filters are designed such that a signal's space/time and frequency characteristics are adequately measured.
A filter response represents the similarity between the signal and the filter, while no explicit assumptions are being made about the underlying data-generating process. This generally results in robust representations.
So let's have a look, and get some data first. We'll stick with 1 dimension for now.
signals, labels = util.make_datamarket_health_dataset()
plot_signals(signals, labels)
print('{} samples'.format(len(signals)))
660 samples
Interesting properties of these data are: min, max, mean, periodicity, ... So we should choose appropriate filters and extract features by convolution.
plot_signals(signals, labels)
A popular filter for capturing space/time/frequency characteristics is the Gabor filter. These are mostly well-known in image processing but generally apply to signals of any dimension.
The Gabor transform is one of many so called band pass filters that allows you to 'cut' the Fourier transform and isolate only specific information.
This is rather complex, but in the following we'll just assume that a Gabor filter is basically a sinusoid multiplied by a Gaussian, with which we can do cool stuff.
# create a gabor filter based on
# a gaussian with std s and
# a sinusoid with frequency f
def make_gabor_filter(s, f, sample_rate=None):
sample_rate = s/10 if sample_rate is None else sample_rate
t = arange(-s, +s, sample_rate)
A = 1; p = 0;
sinusoid = A * np.cos(f * t + p)
gaussian = signal.gaussian(len(t), s**2)
return sinusoid * gaussian
plot_filter(make_gabor_filter(pi/.5, 1, 0.1));
Now create a filter bank by considering a set of parameters for constructing Gabor filters
# create a filter bank from the given set of parameters
def make_gabor_filter_bank(params, sample_rate=1):
return [make_gabor_filter(s, f, sample_rate) \
for (s,f) in params]
# make a sensible set of parameters for filter bank construction
def get_gabor_filter_bank_params(
# default parameter set
std = arange(pi/2, pi/.5, pi/2),
freq = arange(1, 4, 1)):
# joint parameter set
x, y = meshgrid(std, freq)
return array([x.flatten(), y.flatten()]).T
plot_gabor_filter_bank(get_gabor_filter_bank_params())
Let's move on to a classification round using filter banks: which signal represents which disease?
plot_signals(signals, labels);
For constructing fixed-length feature vectors, we need to convolve the signal and aggregate the filter response using some aggregator. This is know as Feature Pooling.
# default feature aggregation, a.k.a. pooling
default_feature_aggregators = [mean, std, max]
# apply a given set of functions the the given data
def apply_all(x, fs):
return array([f(x) for f in fs])
# apply a filterbank to a signal
# and extract features using the provided aggregators
def apply_filter_bank(v, fb, aggregators):
fb_result = [apply_all(np.convolve(v, f, 'valid'),\
aggregators)\
for f in fb]
return array(fb_result).flatten()
For each of the 9 filters, we pool the result with a mean, std and max aggregator. This yields 27 numbers.
# build the filter bank
gabor_params = get_gabor_filter_bank_params()
fb = make_gabor_filter_bank(gabor_params)
# apply the filter bank to a signal
apply_filter_bank(signals[0], fb, default_feature_aggregators)
array([ 1495.23176041, 939.92067101, 3367.1978114 , -551.52858247,
1014.9391915 , 1443.31771911, -1694.35771167, 1292.39585964,
296.61427513, -484.51676414, 300.38765524, 50.11674388,
567.52926912, 430.0839714 , 1560.15736545, -547.93679011,
525.11152748, 336.15900611, -164.65933555, 108.19364065,
-22.36140403, -593.83879869, 312.31241484, -145.02812401,
446.11985246, 285.7219919 , 1035.56172359])
Train and test the features.
# apply the filterbank to all signals
def extract_gabor_features(signals, fb, aggregators):
return array([apply_filter_bank(signal, fb, aggregators)\
for signal in signals])
gabor_features = extract_gabor_features(
signals, fb, default_feature_aggregators)
gabor_features.shape
(660, 27)
# leave-one-out classification round with gabor features
result1 = util.l1o_model_validation(gabor_features, labels)
print('gabor features l1o: {:.0f}%'.format(100*result1))
gabor features l1o: 99%
Note: it is not said that this is the best approach with the best parameters for this problem, but we've come a long way in 1 try ;)
Also note that usually max-pooling results in the most discriminative representation.
# store the 1D gabor features for later usage
save('data/feats1D.npy',
hstack([gabor_features, labels.reshape((len(labels),1))]));
# gabor kernels are readily available in scikit-image
from skimage.filters import gabor_kernel
# prepare filter bank kernels
util.figsize(20, 8)
kernels, params = get_gabor_kernels(True)
for i in range(len(kernels)):
splot(4, 4, i+1); # 16 kernels assumed
util.imshow(hstack([util.imscale(real(kernels[i])),
util.imscale(imag(kernels[i]))]))
util.title(params[i])
util.figsize()
Let's extract Gabor features from an image...
wood1 = util.imread('images/vertical_wood.jpg');
wood2 = util.imread('images/horizontal_wood.jpg');
util.image_subplots([wood1, wood2])
...and look how they represent the image content.
wood1_gabor_feats = compute_kernel_feats(mean(wood1, axis=2),
kernels)
wood2_gabor_feats = compute_kernel_feats(mean(wood2, axis=2),
kernels)
plot_feat_hists(wood1_gabor_feats, wood2_gabor_feats)
util.legend(['vertical mean','horizontal mean'])
util.xlabel('Feature index')
util.title('Gabor features on horizontal and vertical textures')
Here, we'll see the concept of creating a visual vocabulary from local features that can be used for constructing bag-of-visual-words image representations. Then, this model will be applied to a bunch of images from cats and dogs using another type of popular image feature (SIFT) - which has dominated the computer vision literature between Fourier and Deep learning (roughly;).
Extract local features from an image
local_gabor_features = compute_local_kernel_feats(
mean(city, axis=2),
kernels,
descriptors_only=False)
locs = local_gabor_features[:, :3]
feats = preprocessing.StandardScaler().\
fit_transform(local_gabor_features[:, 3:])
util.imshow(city)
feats.shape
(16068, 16)
Compute bag-of-visual-words image descriptor
# do clustering of the features to construct 10 visual words
k = 10
gabor_vocab = cluster.KMeans(k).fit(feats)
# map the gabor features to their 'prototypes'
p = gabor_vocab.predict(feats)
image_descriptor = histogram(p, bins=k)[0]
# create the bovw descriptor: the distribution of local features over visual words
image_descriptor / sum(image_descriptor)
array([ 0.09366443, 0.04288026, 0.24166044, 0.06198656, 0.05072193,
0.11426438, 0.05146876, 0.21521036, 0.0238362 , 0.1043067 ])
Visualize a bunch of image patches that have been mapped to the same visual word.
word_nr = 6
l = random.permutation(locs[p==word_nr])
for i in range(len(l)):
if i < 24:
splot(3, 8, i+1)
util.imshow(get_patch(city, l[i], zoom=1))
Now the same vocabulary can be used for representing any other image
bim = make_border_image(200)
bim_gabor_feats = compute_local_kernel_feats(
bim, kernels, descriptors_only=False)
bim_locs = bim_gabor_feats[:, :3]
bim_feats = preprocessing.StandardScaler().\
fit_transform(bim_gabor_feats[:, 3:])
util.imshow(bim)
bim_feats.shape
(900, 16)
And we can see that similar structures are mapped to the same word
bim_p = gabor_vocab.predict(bim_feats)
bim_l = random.permutation(bim_locs[bim_p==word_nr])
for i in range(len(bim_l)):
if i < 24:
splot(3, 8, i+1)
util.imshow(get_patch(bim, bim_l[i], zoom=1))
And maybe this 3D Gabor application is a nice way to conclude the Gabor stuff.
Before the rise of deep learning with convolutional neural networks, the status quo in computer vision was constituted by the bag-of-visual-words model. This model involves 1) adequately representing local image features 2) constructing a visual vocabulary for vector quantizing the feature vectors and 3) training a classifier with the image features and associated labels (e.g. dog, cat). In the following we'll touch on the aspects involved.
For this occasion, I have hidden most of the code and I will skip the details because at the end of the tutorial I recommend you pip install Keras anyway ;)
Probably the most cited article in computer vision is David Lowe's paper on SIFT: the Scale Invariant Feature Transform. These are local image features that have dominated computer vision, and are typically better suitable for general vision problems than Gabor features.
There's quite a lot to it - here we'll just mention the Feature Transform part: it uses features based on the image gradient, which we saw earlier in the tutorial. But scroll through the paper for the fun of it.
Note the importance of local features for robustness against geometric distortions due to scale, rotation, viewpoint and occlusion.
Let's just extract local image features using Mahotas, as that came up as easy and free to use - SIFT is patented and Mahotas provides a convienent lib for extracting a fast approximation called SURF (elected in 2016 as ECCV's best paper of 10 years ago).
# import surf from Mahotas
from mahotas.features import surf
ucity = util.float3d_to_uint2d(city)
uforest = util.float3d_to_uint2d(forest)
# detect feature locations and extract feature vectors
# let's not worry about all the possible parameter settings
city_features = surf.surf(ucity)
forest_features = surf.surf(uforest)
# Plotting the local features on the image canvas,
# we can see the different scales and orientations.
util.figsize(12, 12)
util.imshow(surf.show_surf(ucity, city_features[:250]))
util.figsize()
So now that we have a way to extract robust local features, we can proceed by learning the visual vocabulary. For this, we'll use dogs and cats.
print(len(labels), 'files,',
sum(labels), 'dogs,',
len(labels)-sum(labels), 'cats')
100 files, 50 dogs, 50 cats
util.image_subplots([util.imread(filepaths[1], edit_dir=False),
util.imread(filepaths[-2], edit_dir=False)])
util.image_subplots([util.imread(filepaths[0], edit_dir=False),
util.imread(filepaths[-1], edit_dir=False)])
That looks like a pretty hard problem
We'll be extracting dense features, because that usually performs better than sampling from keypoints (more=more). And, in addition to that, in this lecture we're using only 100 images but we need enough data to be able to construct the visual vocabulary.
# construct the vocabulary (may take a while)
num_words = 100
surf_vocab = create_visual_vocabulary(filepaths, num_words)
surf_vocab
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=100, n_init=10, n_jobs=-1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
def extract_surf_descriptor(filepath):
global surf_vocab
return extract_image_descriptor(filepath, surf_vocab)
with mp.Pool(6) as descriptor_pool:
image_descriptors = array(descriptor_pool.map(
extract_surf_descriptor,
filepaths))
image_descriptors.shape
(100, 100)
Go bananas
# evaluate performance of cat/dog recognition
for cls in [KNeighborsClassifier(1), LogisticRegression(),\
SVC(kernel='rbf'), SVC(kernel='linear')]:
res = util.l1o_model_validation(image_descriptors,
array(labels),
classifier=cls)
print('{}\t{:.0f}%'.format(cls.__str__()[:50], 100*res))
KNeighborsClassifier(algorithm='auto', leaf_size=3 59% LogisticRegression(C=1.0, class_weight=None, dual= 64% SVC(C=1.0, cache_size=200, class_weight=None, coef 56% SVC(C=1.0, cache_size=200, class_weight=None, coef 63%
Ok, well, let's just stick with this. In practice you will also have to do multi-scale sampling, cross-validating the number of visual words and model hyper-parameters, consider different implementations for feature extraction, use a richer bovw model like VLAD or Fisher Vectors, with spatial pyramid, etc, etc
Because we are mortal
Having executed all those feature engineering tasks, we are left with big matrices of numbers that are difficult to interpret - usually more so than with 'traditional data'. So we should invoke some learning step that identifies the relevant features. Here, we consider Principal Component Analysis and Linear Discriminant Analysis.
Load the features and labels from previous sections
# best cat/dog model in previous run was a linear SVM
bovw_cls = SVC(kernel='linear')
gabor_features, disease = load_numpy_dataset('data/feats1D.npy')
gabor_res0 = util.l1o_model_validation(gabor_features, disease)
bovw_features, pet = load_numpy_dataset('data/feats2D.npy')
bovw_res0 = util.l1o_model_validation(bovw_features, pet,
classifier=bovw_cls)
print('gabor0: {:.2f}\nbovw0: {:.2f}'.\
format(gabor_res0, bovw_res0))
gabor0: 0.99 bovw0: 0.63
With PCA we can obtain an unsupervised dimensionality reduction, while LDA is a supervised technique.

Using increasing amounts of principal components for classifying diseases. With a small number of features we get almost optimal performance!
# this takes a while
plot_progressive_pca_validation(gabor_features,
disease, 'disease')
Using increasing amounts of principal components for classifying pets. With a small number of features we get better performance than with all features!!
plot_progressive_pca_validation(bovw_features,
pet, 'pet',
classifier=bovw_cls)
# use PCA for selecting 2 features
pca_feats = decomposition.PCA(n_components=2) \
.fit_transform(gabor_features)
# use LDA for selecting 2 features
lda_feats = discriminant_analysis.LinearDiscriminantAnalysis() \
.fit_transform(gabor_features, disease)
pca_lda_plot(pca_feats, lda_feats, disease)
Making the features better linearly seperable does not help so much when using a nearest neighbor classifier!
Use a linear SVM to boost performance on LDA features (a bit)
print('PCA+SVM: {:.0f}%\nLDA+SVM: {:.0f}%'.format(
100*util.l1o_model_validation(pca_feats, disease,
svm.SVC(kernel='linear')),
100*util.l1o_model_validation(lda_feats, disease,
svm.SVC(kernel='linear'))))
PCA+SVM: 83% LDA+SVM: 93%
Plus, we see that performance on the PCA-ed feats decreases!
We have learned in this tutorial what convolution and pooling is. A CNN consists of several more layers for mapping input signals to output classes, providing an end-to-end learning mechanism for both the representation and the classification function.

The input layer of a CNN consists of filters that are used to extract features from the input signal by convolution. For images, the learned filters resemble the Gabor filters quite a lot.

Ok, so now we don't have to put engineering effort in feature extraction, but we have to think about / experiment with:
The following describes one such combination of choices
The mnist written digits dataset is a common benchmark for image classification and ships with Keras itself:
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
util.image_subplots((squeeze(X_train[0,:,:]),
squeeze(X_train[1,:,:])))
Using TensorFlow backend.
The data is further preprocessed (not shown), and with Keras it is easy to build a CNN:
from keras.models import Sequential
model = Sequential()
input_shape = (28, 28, 1)
n_filters = 32
filter_size = 3
pool_size = 2
We can create a filter_size * filter_size convolutional layer as an instance of Convolution2D
from keras.layers.convolutional import Convolution2D
conv_layer = Convolution2D(n_filters, filter_size, filter_size,
border_mode='valid',
input_shape=input_shape)
model.add(conv_layer)
Add activation (do / not send signal to the next layer) and pooling layers:
from keras.layers import Activation
from keras.layers.convolutional import MaxPooling2D
model.add(Activation('relu'))
model.add(Convolution2D(n_filters, filter_size, filter_size))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(pool_size, pool_size)))
And finally map the convolution and pooling results to the output categories (0-9):
from keras.layers import Flatten, Dense
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(n_classes))
model.add(Activation('softmax'))
Train the model, using batch_size samples per iteration, and observing the whole dataset nb_epochs times.
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, y_train,
batch_size=512, nb_epoch=2,
validation_data=(X_test, y_test));
Train on 60000 samples, validate on 10000 samples Epoch 1/2 60000/60000 [==============================] - 86s - loss: 0.3595 - acc: 0.8976 - val_loss: 0.1078 - val_acc: 0.9685 Epoch 2/2 60000/60000 [==============================] - 79s - loss: 0.0865 - acc: 0.9749 - val_loss: 0.0623 - val_acc: 0.9805
score = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', score[1])
Test accuracy: 0.9805
There are deep models trained on millions of images and classes, right underneath your fingertips. We can use those to classify images, or to extract features and use those for our classification problem.
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
Using the dog/cat images still pointed to by filepaths, extract deep features using a pretrained CNN from Keras, with special thanks to Oxford's Visual Geometry Group.
model = VGG16(weights='imagenet', include_top=False)
with mp.Pool(6) as vgg_pool:
vgg_feats = array([extract_vgg_features(img_pp(filepath),
model)
for filepath in filepaths])
Show us the money!
util.l1o_model_validation(vgg_feats, pet)
0.53000000000000003
...what a disappointment!
Recall that we are using the output of the usually one-but-last layer, so we probably have to take one more transformation step. Let's use PCA.
ncs = arange(5)+1
pca_res = []
for nc in ncs:
pca = decomposition.PCA(n_components=nc)
pca_res.append(util.l1o_model_validation(vgg_feats,
pet,
normalizer=pca))
With PCA applied to the deep features, we achieve almost perfect performance!
plot(ncs, pca_res);
xlabel('# principal components'); ylabel('# accuracy');
ylim([0.88,1]); xticks(ncs);
title('Cat vs Dog recognition with VGG deep feats and PCA');
(Like, OMG!!##!? WTF????!!&%$! is this real)
scatter(pca_vgg_feats[labels==0,0],
pca_vgg_feats[labels==0,1], c='b');
scatter(pca_vgg_feats[labels==1,0],
pca_vgg_feats[labels==1,1], c='r');
legend(['cats','dogs'], fontsize=15, loc='upper left');
Only if there is time left, otherwise just try this at home.
# standard audio sample rate (samples per second)
sample_rate = 44100.
From a couple of speakers we have 2 recordings and associated metadata, so we can build models from speech data to infer that metadata.
from metadata import *
print(list(people.keys()))
people['Ivo']
['Marcel', 'Ivo2', 'Andrew', 'Nelli', 'Gabriele', 'Ivo', 'Jelte', 'Ron']
{'age': 35, 'gender': 'male', 'mood': 'neutral', 'nationality': 'Dutch'}
sentences
{'s1': 'I want to implement a speech recognition system',
's2': 'the fourier transform is super cool but rather complex'}
I recorded the audio on my mac with quicktime, resulting in m4a files. With ffmpeg you can convert between many audio, image and video formats. For this tutorial I shipped only the wav files.
# point to folder where data is stored
data_folder = 'data/speech_samples/'
def m4a2wav(data_folder):
m4a_list = [os.path.join(data_folder, file_name)
for file_name in os.listdir(data_folder)
if file_name.endswith('m4a')]
for m4a_file in m4a_list:
os.system('ffmpeg -y -i {} {}'.\
format(m4a_file, m4a_file[:-3]+'wav'))
# convert all m4a files
for person in people.keys():
m4a2wav(os.path.join(data_folder, person))
IPython provides awesome utilities
from IPython.display import *
# play a wav file
def play_sound(data):
global sample_rate
if type(data) == str:
return Audio(filename=data,
autoplay=False, rate=sample_rate)
else:
return Audio(data.astype(int16),
autoplay=False, rate=sample_rate)
play_sound(os.path.join(data_folder, 'Ivo', 's1.wav'))
Load the speech data
from scipy.io import wavfile
# return mono signal and normalize to 1/-1 as most loud/quite
def read_wave(filepath):
sample_rate, audio_channels = wavfile.read(filepath)
return audio_channels[:,0] /\
float(np.max(np.abs(audio_channels[:,0])))
s1 = read_wave(os.path.join(data_folder, 'Nelli', 's1.wav'));
s2 = read_wave(os.path.join(data_folder, 'Gabriele', 's1.wav'));
Plot the signals
subplot(121); plot(s1); title('s1 pronounced by Nelli');
xlabel('time'); ylabel('normalized amplitude');
subplot(122); plot(s2); title('s1 pronounced by Gabriele');
Extract features using the Fast Fourier Transform
p1, f1 = fourier_transform(s1); p2, f2 = fourier_transform(s2);
plot(f1, p1, label='Nelli'); plot(f2, p2, label='Gabriele');
pyplot.xlim([0, 1000]); pyplot.legend(fontsize=15);
xlabel('frequency'); ylabel('power');
It looks like we can use the frequency distributions to discriminate between Nelli and Gabriele. Let's chop it up for a compact representation.
# extract the amount of power in a given frequency range
def power_at(power, freq, lbound, hbound):
lbin = argmin(abs(freq - lbound))
hbin = argmin(abs(freq - hbound))
# TODO: correct for non-exact localization
return sum(power[lbin:hbin])
Create a dataset
# dataset params
freq_step = 100
freq_max = 1000
lbounds = arange(0, freq_max, freq_step)
# store features and labels
features = []
labels = []
# keep waves in memory for later usage
waves = []
for person in people.keys():
for s in sentences.keys():
w = read_wave(os.path.join(data_folder, person, '{}.wav'\
.format(s)));
p, f = fourier_transform(w)
features.append(array([power_at(p, f, b, b+freq_step)
for b in lbounds]))
labels.append([person,
people[person]['nationality'],
people[person]['gender'],
people[person]['mood'],
people[person]['age'],
s])
waves.append(w)
Create dataframe (code not shown)
dataset.head(5).T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| f0 | -0.421025 | -0.404128 | -0.43439 | -0.414063 | -0.180684 |
| f1 | -0.628028 | -0.880296 | 0.704927 | 0.819942 | -0.84289 |
| f2 | 0.272792 | -0.325877 | -0.661536 | -0.301881 | -1.24012 |
| f3 | -0.583158 | -0.949314 | 1.89822 | -0.490668 | -0.330657 |
| f4 | -0.932387 | -0.898587 | 2.30568 | 0.19562 | -0.616166 |
| f5 | -0.850168 | -0.661366 | 3.16831 | 0.24307 | -0.861628 |
| f6 | 0.247941 | -0.525531 | 0.648278 | -0.244405 | -1.36581 |
| f7 | 0.874248 | -0.533401 | 1.78762 | 1.67368 | -0.9359 |
| f8 | 2.5395 | 0.624118 | -0.491731 | -1.10561 | -0.777151 |
| f9 | 3.39059 | -0.301294 | -0.0401555 | 0.528934 | -0.627082 |
| name | Marcel | Marcel | Ivo2 | Ivo2 | Andrew |
| nationality | Dutch | Dutch | Dutch | Dutch | Australian |
| gender | male | male | male | male | male |
| mood | neutral | neutral | angry | angry | neutral |
| age | 33 | 33 | 36 | 36 | 40 |
| sentence | s2 | s1 | s2 | s1 | s2 |
| n_name | 5 | 5 | 3 | 3 | 0 |
| n_nationality | 1 | 1 | 1 | 1 | 0 |
| n_gender | 1 | 1 | 1 | 1 | 1 |
| n_mood | 1 | 1 | 0 | 0 | 1 |
| n_age | 2 | 2 | 4 | 4 | 5 |
| n_sentence | 1 | 0 | 1 | 0 | 1 |
lda_plot()
for i, f_spike in enumerate(f_spikes):
subplot(2, 2, i+1);
extend = 150*int(params[i,0])
plot(f_spike[len(f_spike)/2-extend:len(f_spike)/2+extend]);
xticks([]); yticks([]); title(params[i,:])
Look at an example of local features over time
feats = extract_local_features(waves[0], aggs=[max, mean])
imshow(feats.T); colorbar();
# code not shown
bow_feats.shape
(16, 10)
Use the LDA plot to illustrate the bow features
lda_plot(bow_feats)
Do a round of classification on fft and bow features, with the raw/pca/lda version.
fft_bow_cls_wrapper()
*** name *** fft: 0.38/0.31/0.38, bow: 0.12/0.12/0.12 *** nationality *** fft: 0.81/0.81/0.62, bow: 0.44/0.50/0.31 *** gender *** fft: 0.94/0.94/0.88, bow: 0.75/0.75/0.62 *** mood *** fft: 0.75/0.69/0.69, bow: 0.62/0.69/0.62 *** age *** fft: 0.44/0.38/0.44, bow: 0.12/0.12/0.19 *** sentence *** fft: 0.56/0.56/0.56, bow: 0.69/0.88/0.62
On these data, fft based features seem to perform well
A CNN can also be trained on 1D signals such as speech. Here, it does not make much sense due the small amount of data, but if you have larger amounts I suggest you use this.
In the following, we'll build a 1D CNN for discriminating between the 2 sentences.
After preprocessing the data (not shown) we can train the model
# fit the model using train- and validation- data
model.fit(pwaves[:8], keras_labels[:8],
batch_size=4,
nb_epoch=5,
validation_data=(pwaves[8:12], keras_labels[8:12]));
Train on 8 samples, validate on 4 samples Epoch 1/5 8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 4.9573 - val_acc: 0.5000 Epoch 2/5 8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 4.9898 - val_acc: 0.5000 Epoch 3/5 8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 5.0168 - val_acc: 0.5000 Epoch 4/5 8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 5.0393 - val_acc: 0.5000 Epoch 5/5 8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 5.0579 - val_acc: 0.5000
Maybe we can do any good after all.
res = model.evaluate(pwaves[12:], keras_labels[12:], verbose=0)
print('Accuracy = {}'.format(res[1]))
Accuracy = 0.5
Not really, but that's ok
We can visualize the filters that have been learned. We set the number of filters to 2 and the filter size to 100 samples.
filters = model.layers[0].get_weights()
for i, f in enumerate(filters[0].reshape(filter_size, num_filters).T):
plot(f, label='filter {}'.format(i+1))
legend();
And we can see that a 100-sample snippet of a wave sort of resembles the filters:
plot(pwaves[0][900:1000]);
Thank you and have a great time at PyData.
ivoeverts@godatadriven.com